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Abstract. Given a set of vectors (the data) in a Hilbert space H, we prove 
^v^j the existence of an optimal collection of subspaces minimizing the sum of the 

square of the distances between each vector and its closest subspace in the 
i-O collection. This collection of subspaces gives the best sparse representation 

for the given data, in a sense defined in the paper, and provides an optimal 
MH model for sampling in union of subspaces. The results are proved in a general 

setting and then applied to the case of low dimensional subspaces of M. N and 

to infinite dimensional shift-invariant spaces in L 2 (R d ). We also present an 
l | iterative search algorithm for finding the solution subspaces. These results 

are tightly connected to the new emergent theories of compressed sensing and 
*- \ dictionary design, signal models for signals with finite rate of innovation, and 

the subspace segmentation problem. 

•s 

B 

1. INTRODUCTION 

CN 

A new paradigm for signal sampling and reconstruction recently developed by Lu 
and Do [LD07J starts from the point of view that signals live in some union of 
subspaces M. = U, e /^, instead of a single vector space M. — V such as the space of 
band-limited functions also known as the Paley- Wiener space. This new paradigm 
^ is general and includes (when M. = V) the classical Shannon sampling theory and 

q its extensions |AG01j . as well as sampling of signal with finite rate of innovation 

£ — . (see e.g., [MV05 DVB07]). In the new framework, when we have more than one 

subspace, the signal space model M = U ie /V; is non-linear and the techniques for 
reconstructing a signal / <G Uig/V^ from its samples {.f(xj)}j are involved and the 
reconstruction operators are non-linear. 

Since for each class of signals the starting point of this new theory is the knowledge 
of the signal space M. = U,; e /V^, the first step for implementing the theory is 
to find an appropriate signal model M. = Ui^iVi from a set of observed data 
T = {fx, . . . , f m }. For the classical sampling theory, the problem of finding the 
shift-invariant space model M. — V from a set of observed data has been studied 
and solved in ACHMR03 , ACHM07J. For the new sampling paradigm, the problem 
consists in proving the existence and finding subspaces V% , • • • , Vi , of some Hilbert 



00 
O 

o 



X 

5-H 



Date: February 7, 2008. 

2000 Mathematics Subject Classification. Primary 41A65, 42C15 ; Secondary 68P30, 94A20. 

Key words and phrases. Sampling, Sparsity, Compressed Sensing, Frames. 

The research of Akram Aldroubi is supported in part by NSF Grant DMS-0504788. The 
research of Carlos Cabrelli and Ursula Molter is partially supported by Grants: PICT 15033, 
CONICET, PIP 5650, UBACyT X058 and X108. 



1 



2 



AKRAM ALDROUBI, CARLOS CABRELLI, AND U.MOLTER 



space Ti that minimize the expression 

m 

e(f ) {l' ll ...^}) = ^minrf 2 (/„y j ), (1.1) 
i—i 

over all possible choices of / subspaces belonging to an appropriate class of sub- 
spaces of Ti. Here T = {fi, ■ ■ ■ , f m } C Ti is a set of observed data and d is the 
distance function in Ti. 

It is well known that the problem of sampling and reconstruction of signals with 
finite rate of innovation is closely related to the developing theory of compressed 
sensing (see e.g., |CRT06I ICR061 ICT06I lDeV07l IDon06l IRSV06] and the references 
therein) . Compressed sensing proposes to find a vector x € from the knowledge 
of the values, when applied to x, of a relatively small set of functionals {ipk ■ 
k = 1, . . . ,p} (where p « N). Obviously, the problem of finding x from the set 
{iJk = (x, ipu) ■ k = 1, . . . ,p} is ill-posed. However, it becomes meaningful if x is 
assumed to be sufficiently sparse. 

A typical assumption of sparsity is that x has at most n non-zero components 
(ll^llo _• n )> where n < 2p << N. As a consequence of this assumption of sparsity, 
the vector x belongs to some union of subspaces, each of which is generated by ex- 
actly n vectors from the canonical basis of M. N . In matrix formulation this problem 
can be stated as follows: find x G R w with ||x||o < n from the matrix equation 
y = Ax where A is a p x N matrix and y is a given vector in W. 

A related problem consists in finding an approximation to the vector y using a 
sparse vector x. Formally, this problem can be stated as follows: find min||x||o 

subject to the constraint \\Ax — y|| 2 < e for some given e. The above two problems, 
their analysis, extensions, and efficient algorithms for finding their solutions can 
be found in |A EB06aJ IAEB06bl IBDDW07I ICRT06I ICR06l ICT061 IDeV07l IDon06l 
IGN031 ITro04] and the references therein. 

If in the above problems the matrix A is also an unknown to be found together with 
the set of unknown vectors {x t : i = 1, . . . , m} C M. N , then these problems become 
the problems of finding a dictionary A from the data {yi : i = 1,... ,m} C M p 
obtained by sampling the sparse vectors {xi : i — l,...,m} C R N see e.g., 
|AEB06bl IAEB06al IGN03] , In this context, the columns of A are called atoms 
of A. Under appropriate assumptions on the data and dictionary, the problem has 
a unique solution up to a permutation of the columns of A AEBOGb. AEB06aJ. 
Finding the solution to this problem by exhaustive methods is computationally 
intractable, but the K-SVD algorithm described in |AEB06a] provides a computa- 
tionally effective search algorithm. 

The problem of finding the signal model for signals with finite rate of innovation 
consists of finding a set A4 = U; e /Vi, formed by subspaces Vi that are infinite di- 
mensional, in general, but usually structured, e.g., each Vi is a shift-invariant space. 



However, the signal modeling problem as described by ( 1.1 1 is closely related to the 
dictionary design problem for sparse data, described in the previous paragraph. 

To see this relation, let us formulate the dictionary design problem as follows: given 
a class of signals, determine if there exists a dictionary of small size, such that each 
of the signals can be represented with minimal sparsity. 
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More precisely, assume that we have a class of to signals, where to is a very large 
number. We want to know whether there exists a dictionary, such that every signal 
in the class is a linear combination of at most n atoms in the dictionary. Clearly, 
to make the problem meaningful and realistic the length of the dictionary should 
be small compared with to. 

It follows, that if for a given set of data such a dictionary exists, then the data 
can be partitioned into subsets each of which belongs to a subspace of dimension 
at most n (i.e. to the subspace generated by the atoms that the signal uses in its 
representation). That is, each subset of the partition can be associated to a low 
dimensional subspace. 

Conversely, if our class of signals can be partitioned into I subsets, such that the 
signals in each subset belong to a subspace of dimension no bigger than n, then 
by choosing a set of generators from each of the subspaces, we can construct a 
dictionary of length at most In with the property that each of the signals can be 
represented using at most n atoms in the dictionary. 

This suggests that the problem of finding a dictionary where the signals have sparse 
representation can be solved by finding a small collection of low dimensional sub- 
spaces containing our signals, and viceversa. 



So, we will say that the class of signals is (l,n)-sparse (see also Definition 4.3 1 if 
there exist I subspaces of dimension at most n, such that the signals in our class 
belong to the union of these I subspaces. From the above discussion, it is clear that 
if our data is (I, n)-sparse then there exists a dictionary of length at most In. 

A related problem is the subspace segmentation problem for a set of signals in 
R (see for example [MDHW07I IMYDF07] ). This problem occurs in the context 
of segmentation clustering and classification, and consists in finding whether there 
exist / subspaces of dimension at most n, such that the signals in the class belong to 
the union of these I subspaces. The subspace segmentation problem has important 
applications in computer vision, image processing and other areas of engineering, 
and it has recently been solved using algebraic methods and algebraic geometrical 
tools [VMS05 . The method for solving it (known as the Generalized Principle 
Component Analysis (GPCA)) has also been extended to deal with moderate noise 
in the data |VMS05| . Moreover, the uniqueness problem has been addressed in 
|M YDF07J ■ 

Now assume that for a given I and n our data is not (I, n)-sparse. In that case 
we prove that there still exists a collection of optimal subspaces providing the 
needed sparsity. More precisely, if e > is given, we determine that there exists a 
collection of I subspaces of dimension at most n such that the sum of the squares 
of the distance of each signal to the union of the subspaces (i.e., the total error) 



is not larger than e, (see formula (1.1 1). In that case we will say that our data is 
(I, n, e) -sparse. 

As before it is clear that if our data is (I, n, e)-sparse, then a dictionary of length 
at most In exists such that every signal in our class can be approximated using 
a linear combination of at most n atoms from the dictionary, with total error not 
larger than e. 

Note that this definition of sparsity is an intrinsic property of the data and the 
space where they belong to, and does not depend on any fixed dictionary. 
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A relevant and important question is then, given a class of signals and a small 
number n, which is the minimun possible e such the data is (/, n, e)-sparse? 

In this paper we present a general scheme that allows us to solve the problem 
described in (1.1 1, thereby finding the signal model for the new signal sampling 
paradigm described in [LD07J, finding a new method for solving the segmentation 
subspace problem that is optimal in the presence of noise [VMS05], and solving 
the (I, n, e)-sparsity problem (in the sense defined above) for a given set of data, in 
different contexts. Specifically, given a set T of m vectors and numbers l,n such 
that n,l < m, we prove the existence of no more than I subspaces of dimension no 
bigger than n that provide the minimum e such that the vectors in T are (l,n, e)- 
sparse. When the minimum e is zero, the data is (I, n)-sparse. We also give an 
iterative search algorithm to find the solution subspaces. 

It is important to remark here that an optimal solution can have less than I sub- 
spaces, and the dimensions of the subspaces can be less than n. Since the minimiza- 
tion we consider is over unions of no more than I subspaces, where the dimension 
of the subspaces is no bigger than n, some of the optimal solutions for a given (/, n) 
(that is, some of the solutions that give the smallest e) will yield the minimum l < I 
such that the data is (Iq, n, e)-sparse, that is / is set to be just an upper bound for 
the number of allowable subspaces. Furthermore, the number n constraining the 
dimension of the subspaces is also only an upper bound, that is, an optimal solution 
can have subspaces of dimension strictly less than n. 



1.1. Organization and Contribution. In this paper we solve the abstract prob- 
lem described in (1.1). Unlike prior work in the subspace segmentation problem 
(see e.g., [VMS05 and the references therein), it does not assume that the data 
T comes from union of subspaces, but instead it finds the best union of subspaces 
that matches the data, and therefore it is well adapted for subspace segmentation 
in the presence of noise, and for the problem of sparsity and dictionary design in 
compressed sensing. Moreover, the setting includes finite and infinite dimensional 
spaces, and therefore can be used to solve the signal modeling problem described 
in |LD07| . The subspaces that are sought are not restricted to be orthogonal, or 
with equal dimensions or with trivial intersection, and there can be any number of 
subspaces up to a prescribed number I. 

In Section [2] we formally state as Problem 1 the question described by (1.1 1 and 
introduce a general abstract scheme for solving this specific problem, together with 
a lemma that will provide the tool for an algorithm described in a later section. 

In Section [3] we consider the case of the Hilbert space L 2 (M. d ) and where the infinite 
dimensional subspaces are shift-invariant. We show that the general theory in 
Section |2] applies to this case and thereby we solve (1.1 1 in this situation. As a 



consequence we show how the signal modeling problem is solved in Section 3.4 



In Section [4] we consider the finite dimensional case M. N and particularize the so- 
lution found in Section [2] to this case. This allows us to tie our method to the 
problem of finding sparsity models and the problem of finding optimal dictionaries 
as described in Section l4~2l 



In Sectio n [5| w e present an iterative search algorithm for finding the solution to 
Problem (1.1). We prove that the algorithm terminates in finitely many steps. 
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The algorithm is iterative and switches between subspace estimation and data seg- 
mentation in a way that is similar to the subspace segmentation methods and the 
K-SVD method described in |AEB06bl IAEB06al IMYDF07I IVMS05] and the refer- 
ences therein. 

2. Abstract Hilbert Space Case 

In this section we will introduce an abstract scheme that, in particular, contains 
the problems mentioned in the introduction. This scheme is much more general 
and can be used in many other situations. 

In this theoretical setting we will prove the existence of optimal solutions and 
provide the mathematical background for the algorithms to find these solutions. 

We will start by describing the basic ingredients for that setting and introducing 
some required notation. First we will define the class of subspaces that we will use 
for the minimization. 

Let Ti be a Hilbert space. For x, y S Ti let us denote by d(x, y) = \\x — y\\n- Given 
a finite subset T C Ti and a closed subspace V of Ti, we denote by E{T , V) the 
total distance of the data set T to the subspace V, i.e. 

E(F,V)= £d 2 (/,n (2.1) 
We set E(T, V) = for T = and any subspace V of Ti. 

Let C be a family of closed subspaces of Ti containing the zero subspace. We will 
say that C has the Minimal Approximation Property (MAP) if for any finite set T 
of vectors in Ti there exist a subspace Vq £ C that minimizes E(T, V) over all the 
subspaces V € C. That is, 

E(J r ,V ) = mmE(J r ,V)<E(J r ,V), VVeC. (2.2) 



Any subspace Vq £ C satisfying (2.2 1 will be called an optimal subspace for T . Note 
that if T = then every subspace in C is optimal. We will choose the zero subspace 
in that case. For the rest of this section we will assume that the class C has the 
Minimal Approximation Property. 

Next, since we are interested in models that are union of subspaces, we will arrange 
the subspaces in finite bundles that will be our main objects, and define the distance 
(error) between a bundle and a set of vectors. 

To do this, let us fix m, I £ N with 1 < I < m and let T = {/i, f m } be a finite 
set of vectors in TL. 

Define 23 to be the set of sequences of elements in C of length I, i.e. 

05 = 05(0 = {V = {Vi,...,Vi}:VieC,l<i<l}. 

We will call these finite sequences bundles. For V 6 05 with V = {Vi, V/}, we 
define, 



e(F,V)='£mm<P(f,V i ). (2.3) 

SeT - J - 

Remark. Note that e[T, V) is computed as follows: For each / 6 T find the space 
Vjff\ in V closest to /, compute d 2 (f, Vj(j)), and then sum over all values found 
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FIGURE 1. Illustration for the objective function in Problem]!] A data 
set consists of three points T = {/i,/2,/3} in K 2 . A) Value of the 
objective function is e = d(fi, V2) + d(fz, Vi) + d(fa, Vi); and B) Value 
of the objective function is e = V2) + d(/2, V2) + d{fz, Vi). Note 

that the configuration of Vi , V% in Panel A forced a partition of the data 
into Pi = {/1} and P2 = {f2,fz\, while the configuration in B forced 
the partition Pi = {/1, f%] and P2 = {fz} for the same data. 



by letting / run through T (see Figure [TJ. Also note that e(J r , V) is a non-linear 
function of T . 

Hence, for the problems described in the introduction, what we want is to minimize 
e over all possible bundles of subspaces. This is formulated in the following problem. 

Problem 1. 

(1) Given a finite set T C Ti, minimize e(^ r , V) over V 6 03. That is, find 

inf{e(J",V) : Ve 33}. (2.4) 

(2) Find a bundle Vo €03 (if it exists) such that 

e{T, V ) = inf{e(^, V) : V e 35}. (2.5) 
Any Vq g 03 £/ia£ satisfies ( |2.5[ ) wiZZ &e called a solution to Pro&Zem[7] 



We will show (Theorem 2.2 1 that Problcm[T]can be solved, i.e. for a given data set 



T, there does exist a bundle Vo that minimizes e(J-, V). Moreover, we propose an 
algorithm to find this optimal bundle. 

If we set I — 1, Ti. — R d , and C = £„ to be the set of all subspaces of dimension 
smaller (or equal) than n, then Problem[l]reduces to the classical least squares prob- 
lem. This last problem has been studied extensively (see Figure |2]for an illustration 
in K 2 ), and it can be solved using the well-known Singular Value Decomposition 
(SVD) (see e.g., [Sch07l IEY55] ). 

Before stating the main results, we need to give some definitions and set some 
notation. 
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FIGURE 2. Illustration for the objective function in Problem [T] Same 
data set T — {fi, f2, f:i} as in Figure [l] but for a single subspace V. 
This objective function is the classical least squares cost function. 



We will denote by II — II; the set of all /-sequences P = {T\, ■ ■ ■ , T{\ of subsets of 
T satisfying the property that for all 1 < i,j < I, 

f,Cf, T = U l s=1 T s , and Ti D Tj = for i ^ j. 

Note that we allow some of the elements of P € II to be the empty set. By abuse 
of language we will still call the elements of II; partitions (of T) . 

For P £ n ; , P= {T^-.^Ti} and V e 03, V = {V u -,Vi} we define, 

l 

r(p,v) = '£E(r i ,v i ). (2.6) 

i=l 

So r measures the error between a fixed partition P and a fixed bundle V. 

The following relations between partitions in II/ and bundles of length I in C will 
be relevant for our analysis. 

Given a bundle V e 23, V = {Vi,...,Vz}cC, we can partition the set T into a best 
partition P = {J-\, Ti}, by grouping together into Ti, the vectors in T that are 
closer to a given subspace V$ than to any other subspace Vj , j ^ i (see Figure [TJ . 
However, there are situations in which a vector / G T is at equal distance from two 
or more subspaces from the bundle. Hence, there may be more than one partition 
associated to each bundle, and there is a subset f2/(V) C n; of best partitions in n; 
naturally associated to V defined by 

P = {J 7 !, Ti} € n; is a member of f2j(V) if it satisfies 
/ € T. 3 implies that d(f, Vj) < d(f, V h ),h = 1, I. 

Conversely, since C has the MAP, given a partition P = Ti}, we can define 

a best bundle Vp = {V±, Vi} G 23 by finding (for each i) the space Vl that 



minimizes (2.1) for the given Ti. However, there are situations in which, for some 



Ti, there are more than one subspace Vi that minimizes (2.1). Hence, there is a 
subset W(P) C 23 of best bundles associated to T defined by 
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Vp = {Vi, ...,VJ} E *B is a member of W(P) if V* is an optimal 



subspace for T% (in the sense of (2.2)) for each i = 1, i 



In what follows when we refer to a best partition associated to a bundle V we will 
mean, any element in fi;(V). Similarly, when we talk of a best bundle associated 
to a partition P, this will mean an element in W(P). 

We also consider the set of all pairs (P, Vp), where P £ Hi and Vp G W(P). We 
will say that a pair (Po, Vp ) is T -minimal if 

r(P ,v Po ) <r(pv P ) (2.7) 

for all such pairs. 

Note that when trying to compute e(jF, V), for each / 6 we first have to find 
the subspace Vjrf) in V that is closest to / and then compute d 2 (f,Vj(f)) (see 
remark after the definition of e(J-, V) and Figure [lj. While for T, a partition is 
given and we just compute the distance of each function to its corresponding space 
(not the closest one necessarily). The surprising fact is that e and T can indeed be 
compared, as the following lemma shows. In addition, this result will later give us 
the key to obtain an algorithm for Problem [l] 

Lemma 2.1. Let (Po, Vp ) be a T -minimal pair. Then we have 

e(f,Vp )=r(Po,Vp ). (2.8) 

Proof. It is clear that e(T, V Po ) < T(P 0l V Po ). 

For the other inequality, if Vp = {Vi, V{\ then for any P G f^(Vp ) we have 

n i 

e{T, Vp ) = J2 ,545, *Ui> ^0 = r ( P ' Vps,). ( 2 - 9 ) 



i=l 



In addition, r(P,Vp Q ) > r(P,V P ), with V P G W(P). But by the minimality 
of r(Po, Vp ) given by hypothesis, we have that T(P, Vp) > T(Po, Vp ), and the 
lemma follows. □ 

We are now ready to prove the following theorem which shows that we can solve 
Problem □ 

Theorem 2.2. Let H be a Hilbert space, m, I positive integers with I < m and 
T = {/i, f m } a set of vectors in H. Then 

(1) There exists a bundle Vo G 03 that solves Problem^ for the data T , that 
is, 

e{T, V ) = inf{e(^, V) : V e 35}. 

(2) //(Po,Vp ) is a r -minimal pair, then all the elements o/W(Po), are soZw- 
izons to Problem [7J 

(3) Furthermore, if Vo is a solution to Problem [7] i/ien £/iere exists Pq G 11; 
smc/i toa< Vo G W(Po), ("i.e. (Po, Vo) is a T-minimal pair. 

In other words, Theorem |2.2| states that Problem [T] has a solution for every finite 
set of vectors T — {fx, fm} C H and every Z > 1 if and only if C has the MAP 
property. One direction of the theorem is trivial. The interesting implication is 
that if Problem [T] can be solved for any T and / = 1 then it can be solved for any 
T and any I > 1. 
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Proof. We will prove that if (Po, Vp ) is a T- minimal pair, then 
e{T,V Po ) <e(.F,V), VVe». 

For this, let us choose an arbitrary V G 05. We have that for each P G £1;(V) 

r(P,V)=e(^,V). 

Clearly T(P,V P ) < r(P,V), for each V P S W(P). 
Because of the minimality of r(Po, Vp ), we have 

r(Po,v Po )<r(p,v P ). 

As a consequence of Lemma |2.1| we know that then 

r(P 0) Vp o ) = e(f,Vp o ), 

which proves, 

e(f,V Po )<e(f,V). 

This shows that if (Po,Vp ) is a T-minimal pair, then each bundle Vp solves 
Problcm[T|for the data T. Since the total number of pairs is finite, then there exist 
minimal pairs. This proves parts (1) and (2) of the Theorem. 

For part (3) let V G 05 be a solution to Problem [l] i.e. e(T, V ) < e(T, V), V V G 
05. Consider P G ft;( v o) and let V Pf) G W(P ). Then, since P G «;( v o) and by 
the minimality of Vo we have 

r(P ,V ) = e(^,V ) < e(f,V Po ) < r(P ,Vp ). 

Therefore, r(P ,V ) < r(P ,V Po ), but by definition of T, r(P ,V Fo ) < r(P ,V) 
for any V G 95. So, 

r(P ,V ) = r(Po,Vp ), and VoGW(Po). 
Moreover, (Po, Vo) is r-minimal since 

r(P , Vo) = eCF.Vo) < e(f,V P ) < T(P, V P ). 
This completes the proof of the Theorem. □ 

Remark. If < h < l 2 , then for any V G 9J(Zi), V = {Vi, . . . , V^} the bundle 
V = {Vi, . . . , V/ 17 {0}, . . . , {0}} belongs to QS(Z 2 ) and therefore, we have 

e(F,V Po (h))>e(F,V Po (h)). 

So the error decreases (or at least does not increase) when I (the number of sub- 
spaces) increases. Note that in case that the number of subspaces equals the number 
of data, the error is zero, since we can pick for each data signal the subspace spanned 
by itself. 

It is important to remark here that optimal bundles can have the zero subspace as 
some of its components. So, if l is the number of subspaces that have dimension 
greater than zero, in some optimal bundle Vo, then the bundle with Iq components 
obtained after the / — Iq zero components are removed from Vo, is also an optimal 
bundle for the Problem [l] when Q3(Z) is replaced by Q3(/o)- Thus as mentioned 
in the introduction, the number I is simply a set to be an a priori upper bound 
on the number of subspaces, and the optimal solution(s) can have any number of 
subspaces Iq < I. 
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3. The Shift-Invariant Space Case and Optimal Nonlinear Signal 

Models 

In this section we will apply the theory of Section [2] to the Hilbert space L 2 (K d ) . 
In order to do that we will select a family of subspaces with the Minimal Approxi- 
mation Property. We will describe in what follows the necessary setting. 

We begin by recalling the definition of frames and some of their properties (see for 
example jCasOOl IChr03l IGroOH IHW96] ) . 

Let TL be a Hilbert space and {ujjig/ a countable subset of TL. The set {u.i} ie j is 
said to form a frame for TL if there exist q, Q > such that 

g||/|| 2 <^|</,^>| 2 <Q||/|| 2 , V/gW. 

If q = Q, then {ui}i<=i is called a tight frame, and it is called a Parseval frame if 
q = Q = l. ' 

If {ui}i£i is a Parseval frame for a subspace W of a Hilbert space TL, and if a E TL, 
then the orthogonal projection of a onto W is given by: 

^w(a) = ^2(a,u l )u i . (3.1) 

iei 

Thus, a Parseval frames acts as if it were an orthonormal basis of W, even though 
it may not be one. 

3.1. Shift-Invariant Spaces. In this paper, a shift-invariant space will be a sub- 
space of L 2 (R d ) of the for m: 

:= closure^ span{ipi(x — k) : i = 1, . . . , n, k € Z d } (3.2) 

where $ = tp n } is a set of functions in L 2 (R d ). The functions ipi, if2, ■ ■ ■ , <p n 

are called a se£ 0/ generators for the space 5 = 5'( < I ) ) = S(ip\, . . . , and any such 
space S is called a, finitely generated shift-invariant space (FSIS) (see e.g., |Bow00] ) . 
These spaces are often used as standard signal and image models. For example, 
if n = 1, d — 1 and <fi(x) = sinc(cc), then the underlying space is the space of 
band- limited functions (often used in communications). 

Finitely generated shift-invariant spaces, can have different sets of generators. The 
length of an FSIS S is, 

l(S) = min{€ G N : 3 <f\, . . . , <pt G S with S — S((p±, . . . , <fe)}- 

If S = {0}, we set l(S) — 0. We will denote by C n the set of all shift-invariant spaces 
with length less than or equal to n. That is, an element in C n is a shift-invariant 
space that has a set of s generators with s < n. 

3.2. The Minimal Approximation Property for SIS. In [ACHM07j it was 
proven that £„ has the MAP. More precisely, 

Theorem 3.1. Let T — {/1, . . . , f rn } be a set of functions in L 2 (M. d ). Then 
there exists V £ C n such that 

m tci 

^Vi-^fit<YJh-yv>hT> vy'ec (3.3) 
i=i »=i 
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Here 3V denote the orthogonal projection onto the subspace V. 
Furthermore, an explicit description of an optimal space (that is not necessarily 
unique) and an estimation of the error, was obtained in [ACHM07] . as is described 
below. Let us call, 

m 

£ (T, n) = ram ]T 1 1 h - 7 V , f\\ 2 . (3.4) 

" i=i 

To compute the error £ (JF, n) we need to consider the Gramian matrix Gjr of J- = 
{fx, ■ ■ ■ , f m }- Specifically, the Gramian G$ of a set of functions $ = {<p±, ■ ■ ■ , <p n } 
with elements in L 2 (M. d ) is defined to be the n x n matrix of Z d -periodic functions 



[G^(aj)}ij = ^2 <^( w + k )<?j(v + we R d \ (3.5) 

keZ d 

where (p^ denotes the Fourier transform of and (pi denotes the complex conjugate 
of (pi. 

The next theorem produces a set of generators for an optimal space V £ C n and 
provides a formula for the exact value of the error. 



Theorem 3.2. [ACHM07 Under the same assumptions as in Theorem 3.1 let 



Ai(w) > A2(w) > ■ ■ • > \ m (u>) be the eigenvalues of the Gramian Gjt(lo). Then 

(1) The eigenvalues \i(u>), 1 < i < m are 7L d -periodic, measurable functions in 
L 2 ([0,l] d ) and 

m - 

£{T,n)= / \{u)dw. (3.6) 

i=n+1 [o,i]^ 

— 1/2 

(2) Let Ei := {ui : Aj(w) f= 0}, and define ofuS) = X i (uj) on Ei and 
&i(u)) — on Ef. Then, there exists a choice of measurable left eigenvectors 
yi(u),...,y n (w) with = j/j m )*, i = l,...,n, associated with the 
first n largest eigenvalues of Gj?(uj) such that the functions defined by 

m 

<Pi(<j) = ai(uj)^2yij(oj)fj(ui), i = 1, . . . ,n, uj £ R d (3.7) 
j=i 

are in L 2 (R d ). 

Furthermore, the corresponding set of functions <j> = {<£>i, ■ • ■ } fn} * s a 
set of generators for an optimal space V and the set {(fi(- — k),k £ Z , i = 
1, . . . , n} is a Parseval frame for V . 



Note that (3.7) says that in particular the generators of the optimal space are 
^2(Z)-linear combinations of the integer translates of the data T . 

3.3. Best Approximation by Bundles of SIS. Let T — {/i,...,/ m } be func- 
tions in L 2 (R d ) and n a positive integer smaller than m. 

The result in Theorem |3.1| says that the class L n has the Minimal Approximation 
Property. 

Define S = {{Si, Si} : Sj £ C n } to be the set of bundles of SIS in C n . Now we 
can apply Theorem |2.2| to conclude that 



12 



AKRAM ALDROUBI, CARLOS CABRELLI, AND U.MOLTER 



Theorem 3.3. Let T = {fx, f m } vectors in L 2 (M. d ), then there exist a bundle 
S = {S°,...,S°} G S such that 



e(T, So) = AUi^) < E ™9, Sj) (3.8) 

* — ' 1<j<1 * — ' l=JSt 

Z=l 2=1 

ower aZZ bundles S = {Si, . . . , Si} G 5. 



Let Pq — {J- 1, ...,J-*i} be a best partition of T associated to the optimal bundle 
S = {S? , . . . , S, } (i.e. P = {^?, . . . ,rf} G fiz(So), is such that G :F° implies 

d^-.S?) <<*(/,•, S° k ), k=l, ...,/). 

Using Theorem 3.2 for each ft, = 1, and such that J 7 ^ ^ 0, a set of generators 



forming a Parseval frame can be obtained for the optimal space S° in terms of the 
singular values and singular vectors of the gramian G-p h associated to the subset 

T° h . Furthermore a formula for the minimum error i?(.F°, SjJ) is given in terms of 
the singular values of G^-o . Therefore, thanks to Lemma 



r (see (2.6)), e(T, So) can be computed exactly. 



2.1 



and the definition of 



3.4. Optimal Signal Models. If it is known a priori that a class of signals belong 
to a union of shift-i nvari ant spac es U l i=1 Si each of length no larger than n, then 



combining Theorems 3.2 and 3.3 we can find the signal model U i=1 Si exactly and 
the generators of each space Si. This solution includes the case where the signal 
class consists of a single shift-invariant space solved in [ACHMR03J . 

When the data T is corrupted by noise, or if the true signals are not from a union of 
shift-invariant spaces U l i=1 Si of length no larger than n, but we still wish to model 
the class of signals by such a union, then we can use the bundle Sq — {S^, ■ ■ ■ , S ; } 



found in Theorem 3.3 to obtain an optimal signal model U^S^ compatible with the 
observed data. The optimal signal model UiS® is a union of infinite dimensional 
spaces S°, but each S° is a shift-invariant space that can be generated by at most 
n frame generators <I> = {</? ,ij • ■ ■ > ^cMil' s « — n > * = 1> • ■ • > 9> 1 — ' ( we on ^y use 
the spaces Sf that have length larger than 0). Each signal fj G T can now be 
modeled by its orthogonal projection /? = c Ps fj onto its closest space S®. which 
consists of countably many but generally infinite linear combinations of atoms, i.e., 

//= EE* 
P =i 

Note that if the data T is corrupted by noise, then the optimal model can be used 
as a denoising method. Other applications of the optimal signal model are those of 
learning, data segmentation, and classification. 



4. The finite dimensional case R , sparse representations, and 

OPTIMAL DICTIONARIES 

In this section we will consider Problem [T] for the case in which the Hilbert space 
is (in applications, usually one thinks of iV as being very large). In this case, 
our data T — {/i, • • • ,/m} are vectors in R w . Let us denote by £ n the set of all 
subspaces of dimension smaller (or equal) than n. To see that £„ has the MAP, 
we will recourse to some well-known results about Singular Value Decomposition 
(SVD) and in particular, the Eckart- Young Theorem. 
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Let us briefly recall the SVD decomposition, (for a detailed treatment see for ex- 
ample [HJ85] , or the Appendix of |ACHM07Q . Let A G R Nxm with columns 
{di, . . . , a m }, and let r be the rank of A. One can obtain its SVD as follows. Con- 
sider the matrix A* A G IR mxm . Since A 1 A is self-adjoint and positive semi-definite, 
its eigenvalues Ai > A2 > • ■ ■ > A m are nonnegative and the associated eigenvectors 
yi , . . . , y m can be chosen to form an orthonormal basis of ]R m . Note that the rank 
r of A corresponds to the largest index i such that \i > 0. The left singular vectors 
ui, . . . ,u r can then be obtained from 

rn 

Va7u 4 = Ayi, that is u, t = X i 1/2 Vijdj. (1 < i < r). 

j'=i 

Here yi = (yn, yim) 1 - The remaining left singular vectors it r +i, . . . , u m can be 
chosen to be any orthonormal collection of m — r vectors in M. N that are perpen- 
dicular to span {ai, . . . , a m }. One may then readily verify that 

m 

i^^^A 1 ^, (4.1) 

fe=i 

where U G R Nxm is the matrix U = {in, . . . , u m }, A 1 / 2 = diag(Aj /2 , A„ /2 ), and 
Y = { yi , ...,y m } G M mxm with [/*[/ = J m = y'y = YYK 

The following theorem of Schmidt (cf. |Sch07j ) (usually coined as Eckart-Young 
Theorem [EY36 ) shows that our set £„ has the MAP. (We again denote by 3V the 
orthogonal projection onto the space V.) 

Theorem 4.1 (Eckart-Young). Let . . . , f m } be a set of vectors in M and r = 
dim (span{fi, . . . , f m })- Suppose that the associated matrix A — . . . , f m }> has 
SVD A = UA 1 / 2 ^ and that < n < r. IfW = span{u!, . . .,u n }, then 

n 

{3V/i, . . . , Twfm} = ^ u iVi = A n 
i=l 

and 

m m 

YW^-^warWl <Y\\ a i- V vai\\l, V^e£„. (4.2) 

i=l i=l 

Furthermore, the space W is unique if X n +i ^ A n . In addition, 

m r 

E{T,W)= min J^U - V v fi\\ 2 = V \j. (4.3) 

4.1. Best Non-Linear Approximation by Bundles of Subspaces in M. N . 

Let T — . . • , / m } be a set of vectors in K. and n < m. As indicated before, 
Theorem |4. 1 1 states precisely that £„ has the MAP. 

Define again S = 58(2) to be the set of non-empty bundles of length I in £„, i.e. 
58 = {{Vi, Vi} : Vi G £„, i = l,...,i}. 



We then have the following theorem. 
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Theorem 4.2. Let T = {/i, . . . , /,„} be vectors in R , and let I and n be given 
(I < m, n < N), then there exist a bundle V = {V^ , . . . , VJ } € 03, such that 

n 

e(F,V ) = V min ^(/ l ,T/°)=inf{ e (^,V) :Ve23}. 



2=1 

Let P = {J 7 ?, be the best partition of T associated to the optimal bundle 

V = {V?, V[ } (i.e. P = {Tl, ■ ■ ■ ,rf} € fi z (V ), is such that fj e J? implies 
d{fj,V?)<dUi,Vg), k = l,...,l). 



Now, using Theorem 4.1 for each h = and such that ^ 0, a set of 

generators forming an orthonormal base can be obtained for the optimal space V® 
in terms of the singular values and singular vectors of the matrix Ah associated to 



the subset T\. Furthermore, by (4.3 1, a formula for the minimum error E(T\ , V°), 



and therefore for e(J r , Vq), is given in terms of the singular values of Ah- 



Remark. Theorem 4.2 remains true if we replace the set £„ by the set £(n 1 ,....n i ) of 
bundles {Vi, . . . , Vi\ such that dim V$ < rii, for i = 1, . . . , I. 

4.2. Sparsity and Optimal Dictionaries. Now, we will describ e th e relation 
between the solution to Problem [l] for R N , as described in Section 



4.1 



with the 

problem of dictionary finding and sparsity. Let us introduce the following problem. 
See for example |AEB06a] . 



Problem 2. Given data T = {f±, f m } in R N and positive integers n and d, find 
a dictionary D (i.e. a set of vectors in K ) of length at most d, such that each fi 
can be written as a linear combination of at most n atoms in D. 

That is, (in matrix notation) find a N x r matrix D, with columns a\, a r in M. N 
and r < d, such that there exist an r x m matrix X , with columns x%, x m in W , 
such that T = DX with ||xi||o < n, for i = 1, m. 

Now we will introduce a definition of sparsity and show its connection with Problem 

HI 

Definition 4.3. Let n,l,m be positive integers, with n,l < m. 
Given a set of vectors T — {fx, f m } in M. N and a real number e > 0, we will 
say that the data T is (I, n, e)-sparse if there exist subspaces V\, Vi, of R with 
dim(Vi) < n for i = 1, /, such that 

m 

e(T 1 {V 1 ,...,V l })=Y J ^d 2 (h,V ] )<e. (4.4) 

i— 1 — — 

When T is (I, n, 0)-sparse, we will simply say that T is (I, n)-sparse. We will also 
say that the data is e-sparse if the values of I and n are clear from the context. 

Note that if T is (I, n, e)-sparse, then it is also (I, n, ?7)-sparse for every r\ > e. So, 
usually it is interesting to know the minimun e such that the data is e-sparse. The 
above definition of sparsity is an intrinsic property of the data and the Hilbcrt space 
in which the data lives, and does not depend on any specific dictionary. 
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4.2.1. The case e = 0. Let us now consider the case e = 0. 

If the data T is (I, n)-sparse and V] ,...,^ are optimal spaces (that is when 
efTjfVi ,...,^ }) = and dim^ ) < n for i = then each / G T be- 

longs to some of the spaces {I^°}i=i,...,z. 

For each % = 1, Z, let us call r*j = dim(VP) and let {ma, Wi ri } be an orthonor- 
mal basis of VP. Define 



D=\J{w ij :l<j<r i }c 



The vectors in D have the property that each / e f can be written as a linear 
combination of at most n elements in D. In other words, we have found a dictionary 
D such that it solves Problem [2] for the data T and length s with s = r\ + • • • + r; < 
Zn. 

So Theorem |4.2| provides a solution to Problem [2] with a dictionary of length at 
most In, in case that the data is (Z, n)-sparse. We will see below that if the data T 



is not (I, n)-sparse, then Theorem 4.2 provides the minimum s such that the data 
is (I, n, e)-sparse. 

Note that if the basis of each V® is properly chosen then in many cases the number 
of atoms in the dictionary can be reduced, due to the fact that the subspaces can 
have non-trivial intersections. 

So, as before, let VP, VP be optimal spaces, and let U — {ui, —,u s } be a set of 
vectors with the property that for each i S {1, 1} there is a subset Ui C U such 
that span(Wi) = V® . Set Dq = {iwi, w So } to be a minimal set with this property. 

Then Theorem |4 . 2| implies that Do is a dictionary that solves Problem [2] for data T 
and positive integers n and d = In. We want to remark here that a minimal set is 
not a linearly independent set in general. It is not difficult to see that considering 
all possible intersections of the subspaces V°, a minimal set can be constructed. 



4.2.2. The case e > 0. If the data is not (I, n)-sparse, then Theorem 4.2 implies 
that there is no dictionary D of length d = In or smaller that solves Problem [2] 
for the data T . 

If we still want to find a dictionary of length no larger than In with ||xj||o < n 
for i — X,...,n, then the question is: what error do we have to allow in order to 
have a solution? In other words, what is the minimum e such that the data T is 
(I, n, ff)-sparse? This question gives rise to the following extension of Problem [2] 

Problem 3. Let J- = {fx, f m } in R w , and n, d positive integers. With the same 
notation as in Problem^ given e > 0, find a dictionary D with no more than d 
atoms and a matrix X such that 

\\T-DX\\ < e 

with ||:Ej||o < n, for i = 1, m. 

Theorem |4.2| provides in this case the solution for the minimum possible error and 
establishes the exact value of the error. More precisely, let VP, be a bundle 
of optimal subspaces and let e = e(F, { V] , VJ }). Let us choose as before a 
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minimal set Dq — {wi, w So } for that solution, then we have that there exist 
vectors xi, x m in R s ° such that 

\\T-D X\\=e 

with ||a;i||o < n, for i = 1, m. Furthermore, given any N xr matrix D with r < sq 
and any matrix X with columns X\, ■■■,x m in K r and ||^i||o < n f° r i = lj ...,m, we 
have 

> e. 

So, a solution of Problem [l] gives an optimal solution for Problem [3] and finds the 
exact (optimal) e-sparsity. 

Note that when n or I increase then in general the minimum error will decrease. It 
is also important to emphasize here that some subspaces from an optimal bundle 
for the data T and (I, n) can have dimension zero, so in applications these subspaces 
can be removed and the non-trivial subspaces will produce the same error. Thus, 
the subspaces that are found are not restricted to be orthogonal, or with equal 
dimensions or with trivial intersection, and there can be any number of subspaces 
up to a prescribed number I, and each subspace can be of any dimension up to a 
prescribed number n. 



5. Search Algorithm 



Although Theorem 2.2 establishes the existence of a global minimizer solution to 
Problem[l] an exhaustive search over all possible partitions is not feasible in practice 
and a search algorithm is needed. Lemma |2.1| used in the proof Theorem |2.2| 
suggests an iterative search algorithm that we will present in this section, and we 
will show that the algorithm always terminates in finitely many steps. The search 
algorithm for finding the solution to Problem [l] is given in the program below, 
with the notation of Section [2j It uses two choice functions G,H, where G is a 
choice function assigning P i — ► Vp £ W(P), and H is a choice function assigning 

V^Pefi,(V). 

Algorithm. 

(1) Pick any partition Pi £ II/; 

(2) Find and choose V Pl = G(P X ) £ VV(Pi) C 93 by minimizing r(P x , V) over 
V £ 23; 

(3) Set j = 1; 

(4) While r(P,,V Pj ) > e(r,V Pj ) ; 

(5) Choose a new partition Pj+i = H(Vp.) £ ili(Vp ) associated to Vp,; 

(6) Find and choose Vp. +1 = G(P,-) £ W(Pj), by minimizing r(Pj+i, V) over 
VeQ3; 

(7) Increase j by 1, i.e., j -> j + 1; 

(8) End while 

Note that this algorithm, starting from a bundle V Pl in step (2), produces a se- 
quence of bundles Vp 1} Vpj, Vp 3 , . . . with the property that that e[T , VpJ > 

e(jF, Vp 2 ) > e(J r , Vp 3 ) > The algorithm stops precisely when for some j > 1 

e(!F, Vp.) = e{T , V P . +1 ). We will now see that the algorithm terminates in finitely 
many steps. 
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Proof. We first note that if T(Pj,V Pj ) > e{F,V Pj ), then P j+1 £ fti{V Pj ). To 
see this, we argue by contradiction: if Pj+i £ ili(V P ), then r(P 3+1 ,Vp +1 ) = 
T(Pj , V P . ) . But we have that 

T(P j+1 ,V Pj+1 ) < e(f,V Pj ) < T{Pj, V P .), 

which is a contradiction. 

Therefore, since the set of partitions is finite, the algorithm must stop in at most 
#Hi steps. The algorithm terminates when T(P en d, Vp rn J = e(f, Vp tnii ). □ 

Note. A partition P m such that r(P m ,Vp m ) = e(jF, Vp m ) will be called a minimal 
partition (see remark below). 

(1) The algorithm can be formulated as a search for minimal partitions in 
the partially ordered set (IT,^), where the order of the elements in II; 
depends on the specific choice functions G,H in (2), (5) and (6) of the 
algorithm. Specifically, P < Q \i there exists an integer s > 1 such that 
P = (HG) S Q. Since (II/, r^) is a partially ordered set with finitely many 
elements, a nonempty set of minimal partitions M C 11/ exists. 

(2) The algorithm will always terminate in finitely many steps but the bundle 
Vp m associated to the final minimal partition P m may not be the global 
minimizer of Problem [l] The algorithm can be viewed as a search in a 
directed graph whose vertices are the partitions. Each iteration moves 
from one partition to the next via a directed edge. If the graph has a 
single component, then the algorithm will always end at a partition whose 
associated bundle is a global minimizer. However, if the graph has more 
than one component, then the algorithm will end up at a partition, whose 
associated bundle is not necessarily the global minimizer. 

(3) In the search algorithm, steps (2) and (6) must be implemented by some 
other minimizing algorithms. For the two cases that we studied in this 
paper, a space Vp. +1 that minimizes r(P J+ i, V) over V £ 03 can be explic- 
itly found and computed, by the Eckhard- Young Theorem for subspaces 
of H = R N , and by Theorem 2.1 ( |ACHM07p for shift-invariant spaces of 
H = L 2 (R d ). Both methods are based on the Singular Value Decomposi- 
tion, they are easily implemented, and all the approximation errors can be 
computed exactly. 

(4) In searching for a dictionary with d atoms such that each data point 
/ e M. N from a set of data T = {/i, . . . , f m } is approximated by a sin- 
gle atom, our method coincides with the K-SVD algorithm proposed in 
( AEB06b. AEB06a]) and produces the same dictionary if steps (2) and (6) 
are implemented using the SVD. However, for the case where each data 
point is approximated by a linear combination of n > 1 atoms, the two 
methods are not comparable, even if steps (2) and (6) are implemented 
using the SVD. 
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6. Conclusions 



Theorem 2.2 can be viewed as a way of finding an optimal (generaly non-linear) 
signal model of the form Ll l i=1 Si from some observed data. For example, the ap- 
plication of Theorem |2.2| to shift-invariant spaces in Section |3.1| produced Theo- 
rem |3.3| which gives the optimal signal model of at most I shift-invariant spaces 
compatible with the observed data. The resulting solution is an optimal bundle 
So = {Si , ■ ■ • , S®} that consists of a finite sequence of infinite dimensional spaces, 
such that each space Sf of this sequence is generated by the integer translates of 
at most n generators. This type of best signal model U^S* derived from a set of 
observed data T = {/i, . . . , f m } C L 2 may be used for example in sampling and 
reconstruction as well as other applications. 

If applied to R N , Theorem 2.2 can be viewed as a way of finding an optimal 



sparse representation of the data T C l", optimal dictionaries, and subspace 
segmentation in the pr esen ce of noise as discussed in Section |4.2| Specifically, the 



application of Theorem|2.2|to the finite dimensional case produces Theorem 



4.2 



which gives the optimal £-sparse representation of the data as discussed in Section 
|4.2| In particular, Theorem |4.2| proves the existence of a dictionary with minimal 
error and minimal length for sparse data representation. 

One contribution of this work is that it unifies and complements some of the new 
non-linear techniques used in sampling theory, the Generalized Principle Compo- 
nents Analysis, and the dictionary design problem. However, there are still many 
questions that need to be addressed before this methodology becomes applicable. 
For example, the algorithm proposed in the last section may end up at a local 
minimum which is not a global minimum. The termination of the algorithm de- 
pends on the initial condition. Thus, an interesting question is to estimate the 
number of minima in terms of some characteristics of C and T . Another interest- 
ing question is to estimate the speed at which the algorithm converges in terms 
of some characteristics of C and J- . Testing the dependence of algorithm on the 
initial partition, the data, and noise level, for the case Ti = H. and C — £„ using 
an SVD implementation for steps (2) and (6) is also important for future research, 
and applications. 
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